White Wine Analysis by Jeremy Meguira

The dataset that this report will analyze is a collection of approximately 5000 Portuguese white wines called ‘Vinho Verde.’ There are 11 quantitative variables related to each wine that may or may not affect the most important variable we are concerned with, quality. In this analysis I hope to gain some insight into if and how these variables affect the quality of the wine.

Univariate Plots Section

In this first section I am first looking at summary statistics and then examining the distribution of each of these variables using histograms and boxplots.

## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000

Right away I can tell that there are some outliers in the fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, sulphates, and alcohol variables as the max values lie far outside the IQR. In addition, it’s a small but interesting note that on a quality scale of 0-10 no wine received a score of 0,1,2, or 10.

Looking at these histogram plots, we can see the outliers represented with the large x axes on the variables that they occur. It is also shown that the bulk of the data falls within a pH range of 2.8 and 3.6. This is surprising to me as it shows that the range of pH values is relatively small. If there is a correlation between the pH values of the wine and the quality, the acidity changes are very subtle.

I’m realizing that a lot of this data is very long tailed to the right. This affects the quality of the histogram plots. This is backed up by the boxplot grid where we see that in several of the variables there are many data points outside of the IQR. I’m going to do some transformations in order to clean up these distributions. First I will generate skewness values using the moments package. From what I understand, values from [-1,1] indicate moderate skew. A perfect normal distribution would have a skewness value of 0.

Skew values under no transformation

##                    X        fixed.acidity     volatile.acidity 
##            0.0000000            0.6475531            1.5764965 
##          citric.acid       residual.sugar            chlorides 
##            1.2815278            1.0767639            5.0217922 
##  free.sulfur.dioxide total.sulfur.dioxide              density 
##            1.4063141            0.3905902            0.9774735 
##                   pH            sulphates              alcohol 
##            0.4576423            0.9768944            0.4871927 
##              quality 
##            0.1557487

Appears that volatile acidity, citric acid, residual sugar, chlorides, and free sulfur dioxide are all skewed to the right indicated by the positive skewness values larger than 1. Logarithmic transformations may be necessary for multivariate analyses.

Skew values under logarithmic transformation

##                    X        fixed.acidity     volatile.acidity 
##          -1.94372852           0.07682765           0.13934046 
##          citric.acid       residual.sugar            chlorides 
##                  NaN          -0.16110754           1.13378629 
##  free.sulfur.dioxide total.sulfur.dioxide              density 
##          -0.93603533          -0.98391453           0.93065675 
##                   pH            sulphates              alcohol 
##           0.29873113           0.23368576           0.31003964 
##              quality 
##          -0.40764675

We can evaluate the distribution of the variables under a logarithmic transformation. Looks like volatile acidity, residual sugar, and chlorides have a much more normal distribution under this transformation.

Skew values under square root transformation

##                    X        fixed.acidity     volatile.acidity 
##          -0.56494566           0.35117701           0.78807976 
##          citric.acid       residual.sugar            chlorides 
##          -0.42671390           0.31610663           2.84993244 
##  free.sulfur.dioxide total.sulfur.dioxide              density 
##           0.04967674          -0.16387369           0.95371995 
##                   pH            sulphates              alcohol 
##           0.37748822           0.59208653           0.39776908 
##              quality 
##          -0.10970400

For citric acid it appears the better option would be to use the square root transformation.

Volatile acidity looks much more normal under this transformation.

Residual sugar has a bimodal distribution under this transformation with peaks around .5 and 2.25. Interesting. I wonder why there is this dip at 1.1 log(residual.sugar).

Looks good.

Looks good.

Univariate Analysis

What is the structure of your dataset?

This dataset contains 4,898 different wines with 1 id variable and 12 quantitative factors (fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol, and quality).

What is/are the main feature(s) of interest in your dataset?

The most important variable here is quality as we are attempting to determine how the other 11 variables influence quality. I would like to look into how the different acidity variables interact/influence each other.

What other features in the dataset do you think will help support your

investigation into your feature(s) of interest?

I’m also curious if the sweet/salty (residual sugar/chlorides) levels of the wines are at all related to the pH values as I would suspect that the more acidity the sweeter it would have to be in order to counteract the higher acidity.

In the text file explaining the dataset, it says that the total sulfur dioxide is undetectable under 50 ppm and at higher levels will affect the nose and the taste of the wine. However, it doesn’t mention whether this affects the wine positively or negatively. This is certainly something we can look into as it appears that the vast majority of data points have total sulfur dioxide over 50ppm.

Did you create any new variables from existing variables in the dataset?

Not yet, although I may introduce new variables to categorize quality, pH, or sulfur dioxide if I feel that they will aid the analysis or make it more digestible.

Of the features you investigated, were there any unusual distributions? Did

you perform any operations on the data to tidy, adjust, or change the form of

the data? If so, why did you do this?

Yes, the bulk of the variables in this dataset had significant skew. Specifically, I transformed the distributions of volatile acidity, residual sugar, chlorides, and citric acid. The majority of the univariate section is spent finding appropriate transformations to reduce the skew so that the statistical analysis and modeling will still adhere to the laws of inferential statistics.

Bivariate Plots Section

Looking at the results of our ggpairs utility, I can see that the strongest direct correlation between quality occurs with alcohol content. There is a moderate positive correlation of ~.44. This makes me want to look at the alcoholic strength of wine vs. the quality. Further, I can see that there is a weak to moderate negative correlation between quality and volatile acidity, chlorides, free sulfur dioxide, and density. Its certainly worth examining these relationships to see why this may be the case.

Some other noteworthy observations include: -strong correlation (~.84)between density and residual sugar. -strong correlation values for residual sugar and free/total sulfur dioxide, I suspect that these variables may be important should I attempt to generate a model in the future.

First I’d like to examine how the various acidity variables affect pH. I want to know if it is possible for there to be changes in fixed/volatile acidity and citric acid without affecting the overall pH drastically.

According to these graphs, the pH is affected most by changes in fixed acidity. There is little to no correlation between the levels of volatile acidity or citric acid and pH.

Zooming out a little bit, I have made a scatter plot of pH and quality and a line plot showing mean quality for all pH levels. The scatterplot reinforces that the majority of the wines lie between values of 2.9 and 3.6 pH. In the mean quality graph, we can see an positive trend that indicates a moderate correlation between less acidity and quality. Towards the ends of the plot (< 2.9 and > 3.5) there is a significant amount of noise. This I believe can be accounted for by relatively fewer data points in these ranges. For this reason, I have zoomed in on the quality plot and we can see this positive correlation a little bit more clearly. It seems that there is a “sweet spot” between 2.9 and 3.6 where you want the pH value of your wine to lie such that it is not too acidic or too basic.

Here we have examined the mean quality of the wine vs the alcoholic content. The mean quality is represented in black, the median in orange, and the line model is rendered in blue. We can see the moderate correlation. It is amusing to note that as you might expect the stronger a wine is, the better it will be reviewed. Unfortunately, this graph is pretty noisy.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

Here we can see the strong positive correlation between residual sugar and density. The more residual sugar, the sweeter and more dense the wine becomes. Interestingly, there is a moderate negative correlation between the density and the quality of the wine. Painting a bigger picture, one may assume that past a certain level of sweetness the quality may begin to drop. Let’s take a look at residual sugar and mean quality.

Here the data bears out my suspicion. I’ve used a rounding method to reduce the noise in the original plot. We can see the trend in residual sugar negatively affecting the quality. Now let’s look at similar graphs for density.

The slightly stronger negative correlation is on display here as well for density. Finally, I’d like to look at total sulfur dioxide past 50ppm and how it affects the quality of a wine.

It appears that there is only a very small negative correlation here. Going back to the ggpairs graph, we can see that the value of the correlation is only about -.18 so this makes sense.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

This section was somewhat disappointing in that it did not turn up many strong relationships with quality. We are able to note a moderate positive correlation between alcohol and quality and moderate negative correlations between quality and density, residual sugar, and total sulfur dioxide.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

After examining the various acidity related variables, we found surprisingly that citric acid and volatile acidity do not have a large if any impact on the overall pH of the wine. However, there is a moderate correlation between the fixed acidity and the pH of the wine.

What was the strongest relationship you found?

The strongest relationship that was found is between residual sugar and density. This is logical as the mean density of wine is approximately that of water which lies around 1 g/(cm)^3 whereas the density of sugar is ~1.6. As the sugar content of wine increases, the density increases as well.

Multivariate Plots Section

These graphs are identical to those in the bivariate plots section except I’ve colored the graphs according to a scale gradient based on the quality of the data points. We can see that the colors all blend towards a the middle range of the gradient in all areas of the graphs. This tells me that the quality is not drastically affected by these 3 variables. You would expect to see a concentration of higher or lower quality points in regions if that was the case.

Contrary to the previous section where we saw that there was no quality grouping based on the pH or any of the acidity variables, we can see a clear grouping of higher quality wines underneath the plotted trendline. We can also see a clear grouping of lower quality wines towards above the trendline. This corresponds neatly to the pearson coefficients generated by our ggpairs plot from earlier.

In this next section, I will build a linear model for the quality of wine. Here we must make sure to use the transformations from earlier so that the model can be constructed sticking as closely to the assumptions of inferential statistics as possible.

## 
## Calls:
## m1: lm(formula = I(quality) ~ I(alcohol), data = df)
## m2: lm(formula = I(quality) ~ I(alcohol) + density, data = df)
## m3: lm(formula = I(quality) ~ I(alcohol) + density + log(chlorides), 
##     data = df)
## m4: lm(formula = I(quality) ~ I(alcohol) + density + log(chlorides) + 
##     log(volatile.acidity), data = df)
## m5: lm(formula = I(quality) ~ I(alcohol) + density + log(chlorides) + 
##     log(volatile.acidity) + total.sulfur.dioxide, data = df)
## m6: lm(formula = I(quality) ~ I(alcohol) + density + log(chlorides) + 
##     log(volatile.acidity) + total.sulfur.dioxide + fixed.acidity, 
##     data = df)
## m7: lm(formula = I(quality) ~ I(alcohol) + density + log(chlorides) + 
##     log(volatile.acidity) + total.sulfur.dioxide + fixed.acidity + 
##     pH, data = df)
## m8: lm(formula = I(quality) ~ I(alcohol) + density + log(chlorides) + 
##     log(volatile.acidity) + total.sulfur.dioxide + fixed.acidity + 
##     pH + log(residual.sugar), data = df)
## m9: lm(formula = I(quality) ~ I(alcohol) + density + log(chlorides) + 
##     log(volatile.acidity) + total.sulfur.dioxide + fixed.acidity + 
##     pH + log(residual.sugar) + sulphates, data = df)
## m10: lm(formula = I(quality) ~ I(alcohol) + density + log(chlorides) + 
##     log(volatile.acidity) + total.sulfur.dioxide + fixed.acidity + 
##     pH + log(residual.sugar) + sulphates + sqrt(citric.acid), 
##     data = df)
## m11: lm(formula = I(quality) ~ I(alcohol) + density + log(chlorides) + 
##     log(volatile.acidity) + total.sulfur.dioxide + fixed.acidity + 
##     pH + log(residual.sugar) + sulphates + sqrt(citric.acid) + 
##     free.sulfur.dioxide, data = df)
## 
## ===================================================================================================================================================================================
##                               m1            m2            m3            m4            m5            m6            m7            m8            m9           m10           m11       
## -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
##   (Intercept)                2.582***    -22.492***    -23.060***    -38.416***    -32.004***    -44.664***    -45.179***     45.155***     55.177***     56.816***     52.219***  
##                             (0.098)       (6.165)       (6.151)       (5.999)       (6.266)       (6.464)       (6.484)      (11.218)      (11.369)      (11.396)      (11.408)    
##   I(alcohol)                 0.313***      0.360***      0.334***      0.380***      0.383***      0.399***      0.401***      0.301***      0.286***      0.282***      0.284***  
##                             (0.009)       (0.015)       (0.016)       (0.015)       (0.015)       (0.015)       (0.016)       (0.018)       (0.019)       (0.019)       (0.019)    
##   density                                 24.728***     24.954***     39.264***     32.577***     45.834***     46.631***    -45.114***    -55.241***    -56.965***    -52.371***  
##                                           (6.079)       (6.065)       (5.909)       (6.204)       (6.427)       (6.474)      (11.331)      (11.483)      (11.513)      (11.524)    
##   log(chlorides)                                        -0.196***     -0.140***     -0.153***     -0.153***     -0.154***     -0.097*       -0.099**      -0.105**      -0.102**   
##                                                         (0.040)       (0.038)       (0.038)       (0.038)       (0.038)       (0.038)       (0.038)       (0.038)       (0.038)    
##   log(volatile.acidity)                                               -0.615***     -0.630***     -0.642***     -0.645***     -0.658***     -0.649***     -0.634***     -0.599***  
##                                                                       (0.033)       (0.033)       (0.033)       (0.033)       (0.033)       (0.033)       (0.034)       (0.034)    
##   total.sulfur.dioxide                                                               0.001***      0.001**       0.001**       0.001*        0.000         0.000        -0.001     
##                                                                                     (0.000)       (0.000)       (0.000)       (0.000)       (0.000)       (0.000)       (0.000)    
##   fixed.acidity                                                                                   -0.100***     -0.107***     -0.027        -0.021        -0.027        -0.019     
##                                                                                                   (0.014)       (0.015)       (0.017)       (0.017)       (0.017)       (0.017)    
##   pH                                                                                                            -0.083         0.299***      0.274**       0.291**       0.303***  
##                                                                                                                 (0.081)       (0.089)       (0.089)       (0.090)       (0.089)    
##   log(residual.sugar)                                                                                                          0.238***      0.259***      0.261***      0.246***  
##                                                                                                                               (0.024)       (0.025)       (0.025)       (0.025)    
##   sulphates                                                                                                                                  0.492***      0.483***      0.491***  
##                                                                                                                                             (0.098)       (0.098)       (0.098)    
##   sqrt(citric.acid)                                                                                                                                        0.215*        0.200     
##                                                                                                                                                           (0.109)       (0.109)    
##   free.sulfur.dioxide                                                                                                                                                    0.004***  
##                                                                                                                                                                         (0.001)    
## -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
##   R-squared                  0.190         0.192         0.196         0.250         0.252         0.260         0.260         0.275         0.278         0.279         0.283     
##   adj. R-squared             0.190         0.192         0.196         0.250         0.251         0.259         0.259         0.274         0.277         0.278         0.281     
##   sigma                      0.797         0.796         0.794         0.767         0.766         0.762         0.762         0.755         0.753         0.753         0.751     
##   F                       1146.395       583.290       398.940       408.099       329.683       286.787       245.969       231.479       209.545       189.087       174.909     
##   p                          0.000         0.000         0.000         0.000         0.000         0.000         0.000         0.000         0.000         0.000         0.000     
##   Log-likelihood         -5839.391     -5831.127     -5818.842     -5649.559     -5643.429     -5616.372     -5615.848     -5568.008     -5555.523     -5553.586     -5541.504     
##   Deviance                3112.257      3101.773      3086.252      2880.126      2872.926      2841.359      2840.751      2785.798      2771.632      2769.441      2755.811     
##   AIC                    11684.782     11670.255     11647.685     11311.119     11300.858     11248.743     11249.695     11156.017     11133.046     11131.172     11109.008     
##   BIC                    11704.272     11696.241     11680.167     11350.098     11346.334     11300.716     11308.164     11220.983     11204.508     11209.131     11193.463     
##   N                       4898          4898          4898          4898          4898          4898          4898          4898          4898          4898          4898         
## ===================================================================================================================================================================================

Unfortunately, looking at the r-squared values generated from our model we can see that it is not very good.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

We found that the fixed acidity, volatile acidity, and citric acid levels of the wines seem to have a relatively small impact on the quality of the wine.

In addition, we found that white wines that were less sweet were in fact more highly rated by the reviewers that participated in generating this data. We also illustrated and explained why and how an increase in residual sugar is strongly associated with an increase in density. This was the strongest relationship that was uncovered in this analysis.

Were there any interesting or surprising interactions between features?

I was surprised to see that there was little to no effect of citric acid on the quality of wines. I would expect with a white wine that the refreshing, fruity flavors of citric acid would be highly desirable but apparently that is not the case.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

I did attempt to create a linear model for the quality of the wine. The strength of this model is that it encapsulates all of the variables that were provided in the dataset. The primary weakness of the model is that it is not very good. With all of the variables included, the best r-squared value that was achieved was .283. This is extremely suboptimal for a model that you would hope to use to predict a variable. I think that because all of the variables in this data set are varied and nuanced, you might expect that a linear approach would not be very fruitful. This is mentioned in the informational text included with the dataset. A more sophisticated model beyond my current level of analysis would be required to generate good predictions for the quality of white wines.


Final Plots and Summary

Plot One

Description One

This plot is important as it shows the strongest relationship to quality that we uncovered in this analysis. Ironically, it is one you might have guessed before doing any type of in depth EDA. The stronger the wine, the more the reviewers seemed to like it. Although this is not a very nuanced insight, it is the strongest that I found. We can also see how rounding a factor can reduce the noise in a graph at the sacrifice of information. This rounding factor, however, makes the correlation much more clear to the eye.

Plot Two

Description Two

These graphs show a few things. First, we can see what we originally intended to discover with these plots in that the pH appears to be independent of the volatile acidity and citric acid variables. The fixed acidity does appear to have a large impact on the pH level of the wine. In addition, we can also see from the coloration gradient that there isn’t a concentration of high or low quality wines in any region of the graph. This would imply that these variables do not have a significant impact on the quality variable. This is born out by the coefficient we generated in our ggpairs plot that show very weak correlations between these variables and quality. The strongest correlation was fixed acidity with a coefficient of ~ -.2.

Plot Three

Description Three

Finally, we see here the strongest relationship that was uncovered by our analysis. As you increase the level of residual sugar, the density of the wine increases on a very strong linear relationship. This makes complete sense as sugar is more dense than wine. Thus, the more that you add the more dense the wine becomes. The residual sugar also reflects how sweet the wine is. It appears that higher quality white wines in general are less sweet and therefore less dense as well. This is also reflected in the graph by the color gradient as you can see the high quality grouping below the trendline and the low quality grouping above the trendline.


Reflection

I feel that this dataset was challenging in that I knew very little about the variables included with the dataset before beginning this project. Although I was able to gain a comfortable grasp of them as time progressed, the challenge was in figuring out how they relate to each other and how to analyze them. This, as with any data analysis, is the essential challenge of EDA. Additionally, the fact that all of the variables were quantitative made it difficult to facet or group graphs in any particular way. I think if I was going to do the analysis again I might look to group wines for variables like total sulfur dioxide and pH levels that fall within the sweet spot of not being offensive vs those that are. E.g.: wines within 2.9-3.5 pH being ‘normal’ vs ‘too acidic’ vs ‘too basic.’ I was also surprised to see that there were really no strong correlations between any of these variables and the quality of the wine. Finally, I was very optimistic when constructing my model that I might be able to generate some sort of predictive ability about the quality of wines. This turned out not to be the case as the model I made is about as primitive as it gets. I also feel as though I was a little repetitive in terms of the types of plots that I was generating, but I struggled to think of ways to illustrate the data in new ways that would be insightful.